The City of Melbourne is seeking to develop a comprehensive tool that can aid urban planners, policymakers, and business owners in fostering economic growth and supporting local businesses. To acheive this, they require a deep understanding of the city's business landscape, including the location, industry classification, and job creation of businesses within the city.
This use case focuses on creating an interactive platform that integrates multiple datasets to provide a holistic view of the business environment in the City of Melbourne. The core component of this projects included street data providing geographic coordinates of property addresses, business establishment data covering from period 2002 to 2022 with information on business address, industry classification (ANZSIC4), location, CLUE small area and CLUE industry classification, as well as business establishments and job data offering insights into the number of jobs and business establishments.
At the end of this use case you will:
To aid urban planner/policymaker/business owner in the City of Melbourne, I want to have a comprehension understanding of the city's business landscape, so that I can make informed decisions, support existing business, and foster economic growth in region.
Specifically, I need to be able to:
By having access to a comprehensive and integrated platform that combines various datasets on the city's businessess, I will be able to make data-driven decisions, identify areas of oppurtunity, and implement policies and initiatives that foster a vibrant and prosperious business community in the City of Melbourne.
The City of Melbourne is seeking to develop a comprehensive tool that can aid urban planners, policymakers, and business owners in fostering economic growth and supporting local businessess. To acheive this, they require a deep understanding of the city's business landscape, including the location, industry classification, and job creation of businesses within the city.
This project aims to create a robust and interactive platform that integrates multiple datasets to provide a holistic view of the business environment in the City of Melbourne. The key componenets of the project include street data, business establishment data from 2002 to 2022 and job creation data. By combining these datasets owners with a powerful tool that can support informed decision-making,foster economic growth, and enhance the overall business environment within the City of Melbourne.
The dataset required for this project include:
This project involves a series of steps, including data cleaning, data exploration, data preprocessing, exploratory data analysis, and predictive modeling. By leveraging these techniques, the project ai,s to uncover patterns, trends, and relationships within the City of Melbourne's business landscape, ultimately aid informed decision-making, fostering economic growth, and enhancing the overall business environment.
Importing the required python modules
import warnings
warnings.filterwarnings("ignore")
import requests
import numpy as np
import pandas as pd
import seaborn as sns
from io import StringIO
import matplotlib.pyplot as plt
import geopandas as gpd
from shapely.geometry import Point
import holoviews as hv
import folium
from folium.plugins import MarkerCluster, HeatMap
from wordcloud import WordCloud
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.metrics import classification_report, confusion_matrix, mean_squared_error, mean_absolute_error, r2_score
import plotly.express as px
from sklearn.linear_model import LogisticRegression, LinearRegression
from shapely import wkt
from IPython.display import display, HTML
from scipy.spatial import distance
from scipy.stats import gaussian_kde
import contextily as ctx
from statsmodels.tsa.stattools import acf, pacf
from statsmodels.tsa.arima.model import ARIMA
from statsmodels.tsa.holtwinters import ExponentialSmoothing
from statsmodels.tsa.api import VAR
from math import sqrt
from pmdarima import auto_arima
from statsmodels.tsa.stattools import adfuller
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
# Street address
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='street-addresses'
url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}
response=requests.get(url,params=params)
if response.status_code==200:
url_content=response.content.decode('utf-8')
street_add=pd.read_csv(StringIO(url_content),delimiter=';') #renaming dataset
print(street_add.head(10))
else:
print(f'Request failed with status code {response.status_code}')
geo_point_2d \
0 -37.802381557572, 144.941473440919
1 -37.816860132435, 144.969991449806
2 -37.798830275265, 144.942872100233
3 -37.810546771396, 144.970906397029
4 -37.789293657657, 144.939794028368
5 -37.800782747366, 144.951363910142
6 -37.826775193662, 144.959358160779
7 -37.810925617843, 144.965443591832
8 -37.81275302482, 144.964263891172
9 -37.821209833971, 144.95403377339
geo_shape suburb_id latitude \
0 {"coordinates": [144.941473440919, -37.8023815... 592.0 -37.802382
1 {"coordinates": [144.969991449806, -37.8168601... 591.0 -37.816860
2 {"coordinates": [144.942872100233, -37.7988302... 592.0 -37.798830
3 {"coordinates": [144.970906397029, -37.8105467... 591.0 -37.810547
4 {"coordinates": [144.939794028368, -37.7892936... 592.0 -37.789294
5 {"coordinates": [144.951363910142, -37.8007827... 592.0 -37.800783
6 {"coordinates": [144.959358160779, -37.8267751... 596.0 -37.826775
7 {"coordinates": [144.965443591832, -37.8109256... 591.0 -37.810926
8 {"coordinates": [144.964263891172, -37.8127530... 591.0 -37.812753
9 {"coordinates": [144.95403377339, -37.82120983... 599.0 -37.821210
street_no str_name address_pnt \
0 133 Laurens Street 133 Laurens Street North Melbourne
1 129 Flinders Street 129 Flinders Street Melbourne
2 44 Macaulay Road 44 Macaulay Road North Melbourne
3 13 Punch Lane 13 Punch Lane Melbourne
4 61 Racecourse Road 61 Racecourse Road North Melbourne
5 133 Leveson Street 133 Leveson Street North Melbourne
6 263 City Road 263 City Road Southbank
7 3 Albert Coates Lane 3 Albert Coates Lane Melbourne
8 280 Little Bourke Street 280 Little Bourke Street Melbourne
9 612 Flinders Street 612 Flinders Street Docklands
easting northing gisid longitude suburb street_id \
0 318773.161972 5.814115e+06 48531 144.941473 North Melbourne 781
1 321318.983104 5.812563e+06 37711 144.969991 Melbourne 636
2 318887.633593 5.814512e+06 30476 144.942872 North Melbourne 847
3 321384.307636 5.813265e+06 35165 144.970906 Melbourne 1003
4 318593.277492 5.815564e+06 22247 144.939794 North Melbourne 119624
5 319640.094405 5.814311e+06 54777 144.951364 North Melbourne 791
6 320406.983288 5.811443e+06 46380 144.959358 Southbank 1345
7 320904.296420 5.813213e+06 14049 144.965444 Melbourne 117923
8 320804.859001 5.813008e+06 28925 144.964264 Melbourne 809
9 319924.815030 5.812050e+06 1278 144.954034 Docklands 636
add_comp
0 NaN
1 NaN
2 NaN
3 NaN
4 NaN
5 NaN
6 NaN
7 NaN
8 NaN
9 NaN
# Business establishments and jobs data by business size and industry
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='business-establishments-and-jobs-data-by-business-size-and-industry'
url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC' }
response=requests.get(url,params=params)
if response.status_code==200:
url_content=response.content.decode('utf-8')
# Renaming the dataset as bizsize_industry jobs
industrysize_jobs =pd.read_csv(StringIO(url_content),delimiter=';')
print(industrysize_jobs.head(10))
else:
print(f'Request failed with status code {response.status_code}')
census_year clue_small_area \
0 2015 West Melbourne (Residential)
1 2015 West Melbourne (Residential)
2 2015 West Melbourne (Residential)
3 2015 West Melbourne (Residential)
4 2015 West Melbourne (Residential)
5 2015 West Melbourne (Residential)
6 2014 Carlton
7 2014 Carlton
8 2014 Carlton
9 2014 Carlton
anzsic_indusrty \
0 Health Care and Social Assistance
1 Manufacturing
2 Manufacturing
3 Professional, Scientific and Technical Services
4 Rental, Hiring and Real Estate Services
5 Wholesale Trade
6 Administrative and Support Services
7 Construction
8 Financial and Insurance Services
9 Financial and Insurance Services
clue_industry business_size total_establishments \
0 Health Care and Social Assistance Large business 1
1 Manufacturing Medium business 5
2 Manufacturing Non employing 1
3 Business Services Non employing 3
4 Real Estate Services Small business 5
5 Wholesale Trade Non employing 2
6 Admin and Support Services Medium business 7
7 Construction Small business 11
8 Finance and Insurance Medium business 5
9 Finance and Insurance Small business 35
total_jobs
0 NaN
1 171.0
2 NaN
3 0.0
4 42.0
5 NaN
6 209.0
7 55.0
8 315.0
9 216.0
# Business establishments with address and industry classification
base_url='https://data.melbourne.vic.gov.au/api/explore/v2.1/catalog/datasets/'
dataset_id='business-establishments-with-address-and-industry-classification'
url=f'{base_url}{dataset_id}/exports/csv'
params={'select':'*','limit':-1,'lang':'en','timezone':'UTC'}
response=requests.get(url,params=params)
if response.status_code==200:
url_content=response.content.decode('utf-8')
# Renaming the dataset as bizaddressindustry
industryinfo_classification=pd.read_csv(StringIO(url_content),delimiter=';')
print(industryinfo_classification.head(10))
else:
print(f'Request failed with status code {response.status_code}')
census_year block_id property_id base_property_id clue_small_area \
0 2017 266 109851 109851 Carlton
1 2017 266 109851 109851 Carlton
2 2017 266 534003 534003 Carlton
3 2017 266 664003 664003 Carlton
4 2017 266 664005 664005 Carlton
5 2017 266 664009 664009 Carlton
6 2017 267 102614 102614 Carlton
7 2017 267 102614 102614 Carlton
8 2017 267 106251 106251 Carlton
9 2017 267 106251 106251 Carlton
trading_name \
0 Metropoli's Research Pty Ltd
1 J Hong Restaurant
2 St2 Expresso
3 RMIT Resources Ltd
4 vacant
5 RMIT Resources Ltd Bldg 42
6 J Trioli & Co
7 Marie August
8 Downtowner On Lygon
9 Mellow's Restaurant and Bar
business_address industry_anzsic4_code \
0 Level 1, 74 Victoria Street CARLTON 3053 6950
1 Ground , 74 Victoria Street CARLTON 3053 4511
2 70 Victoria Street CARLTON 3053 4512
3 20 Cardigan Street CARLTON 3053 8102
4 24 Cardigan Street CARLTON 3053 0
5 36-36 Earl Street CARLTON 3053 8102
6 Level 1, 33 Drummond Street CARLTON 3053 6932
7 Part Ground , 33 Drummond Street CARLTON 3053 8512
8 66 Lygon Street CARLTON 3053 4400
9 Part Ground , 66 Lygon Street CARLTON 3053 4511
industry_anzsic4_description longitude latitude
0 Market Research and Statistical Services 144.965352 -37.806701
1 Cafes and Restaurants 144.965352 -37.806701
2 Takeaway Food Services 144.965473 -37.806714
3 Higher Education 144.964753 -37.806312
4 Vacant Space 144.964772 -37.806203
5 Higher Education 144.964974 -37.805887
6 Accounting Services 144.967047 -37.806081
7 Specialist Medical Services 144.967047 -37.806081
8 Accommodation 144.966575 -37.805303
9 Cafes and Restaurants 144.966575 -37.805303
To understand and gain familirity with our dataset, we first examine its structure by checking the number of rows and columns of different dataset. This will help us understanding the datasize and complexity, revealing how different variables and stored and their formats. It would help us understand which columns might futher require further cleaning or transformation and gives us insights into the nature of data we'll be working.
#Street address data set
street_add.head()
| geo_point_2d | geo_shape | suburb_id | latitude | street_no | str_name | address_pnt | easting | northing | gisid | longitude | suburb | street_id | add_comp | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -37.802381557572, 144.941473440919 | {"coordinates": [144.941473440919, -37.8023815... | 592.0 | -37.802382 | 133 | Laurens Street | 133 Laurens Street North Melbourne | 318773.161972 | 5.814115e+06 | 48531 | 144.941473 | North Melbourne | 781 | NaN |
| 1 | -37.816860132435, 144.969991449806 | {"coordinates": [144.969991449806, -37.8168601... | 591.0 | -37.816860 | 129 | Flinders Street | 129 Flinders Street Melbourne | 321318.983104 | 5.812563e+06 | 37711 | 144.969991 | Melbourne | 636 | NaN |
| 2 | -37.798830275265, 144.942872100233 | {"coordinates": [144.942872100233, -37.7988302... | 592.0 | -37.798830 | 44 | Macaulay Road | 44 Macaulay Road North Melbourne | 318887.633593 | 5.814512e+06 | 30476 | 144.942872 | North Melbourne | 847 | NaN |
| 3 | -37.810546771396, 144.970906397029 | {"coordinates": [144.970906397029, -37.8105467... | 591.0 | -37.810547 | 13 | Punch Lane | 13 Punch Lane Melbourne | 321384.307636 | 5.813265e+06 | 35165 | 144.970906 | Melbourne | 1003 | NaN |
| 4 | -37.789293657657, 144.939794028368 | {"coordinates": [144.939794028368, -37.7892936... | 592.0 | -37.789294 | 61 | Racecourse Road | 61 Racecourse Road North Melbourne | 318593.277492 | 5.815564e+06 | 22247 | 144.939794 | North Melbourne | 119624 | NaN |
# Street addressess dataset datatype information
street_add.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 63721 entries, 0 to 63720 Data columns (total 14 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_point_2d 63721 non-null object 1 geo_shape 63721 non-null object 2 suburb_id 63715 non-null float64 3 latitude 63721 non-null float64 4 street_no 63069 non-null object 5 str_name 63720 non-null object 6 address_pnt 63721 non-null object 7 easting 63721 non-null float64 8 northing 63721 non-null float64 9 gisid 63721 non-null int64 10 longitude 63721 non-null float64 11 suburb 62991 non-null object 12 street_id 63721 non-null int64 13 add_comp 1352 non-null object dtypes: float64(5), int64(2), object(7) memory usage: 6.8+ MB
The street address dataset consists of 14 columns with a total of 63721 rows. The street address dataset contains the location information including latitude, longitute, Northing and Easting of every property address within the City of Melbourne.
# Business establishments and jobs data by business size and industry
industrysize_jobs.head()
| census_year | clue_small_area | anzsic_indusrty | clue_industry | business_size | total_establishments | total_jobs | |
|---|---|---|---|---|---|---|---|
| 0 | 2015 | West Melbourne (Residential) | Health Care and Social Assistance | Health Care and Social Assistance | Large business | 1 | NaN |
| 1 | 2015 | West Melbourne (Residential) | Manufacturing | Manufacturing | Medium business | 5 | 171.0 |
| 2 | 2015 | West Melbourne (Residential) | Manufacturing | Manufacturing | Non employing | 1 | NaN |
| 3 | 2015 | West Melbourne (Residential) | Professional, Scientific and Technical Services | Business Services | Non employing | 3 | 0.0 |
| 4 | 2015 | West Melbourne (Residential) | Rental, Hiring and Real Estate Services | Real Estate Services | Small business | 5 | 42.0 |
# Business establishments and jobs data by business size and industry
#datatype information
industrysize_jobs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14692 entries, 0 to 14691 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 14692 non-null int64 1 clue_small_area 14692 non-null object 2 anzsic_indusrty 14692 non-null object 3 clue_industry 14692 non-null object 4 business_size 14692 non-null object 5 total_establishments 14692 non-null int64 6 total_jobs 10365 non-null float64 dtypes: float64(1), int64(2), object(4) memory usage: 803.6+ KB
The industrysize_job dataset consists of 6 columns with a total of 14692 rows. This dataset provides information on different industries classified according to ANZSIC, CLUE industry, CLUE small area classification, and total jobs creation for different business size.
# Business establishments location and industry classification
industryinfo_classification.head()
| census_year | block_id | property_id | base_property_id | clue_small_area | trading_name | business_address | industry_anzsic4_code | industry_anzsic4_description | longitude | latitude | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017 | 266 | 109851 | 109851 | Carlton | Metropoli's Research Pty Ltd | Level 1, 74 Victoria Street CARLTON 3053 | 6950 | Market Research and Statistical Services | 144.965352 | -37.806701 |
| 1 | 2017 | 266 | 109851 | 109851 | Carlton | J Hong Restaurant | Ground , 74 Victoria Street CARLTON 3053 | 4511 | Cafes and Restaurants | 144.965352 | -37.806701 |
| 2 | 2017 | 266 | 534003 | 534003 | Carlton | St2 Expresso | 70 Victoria Street CARLTON 3053 | 4512 | Takeaway Food Services | 144.965473 | -37.806714 |
| 3 | 2017 | 266 | 664003 | 664003 | Carlton | RMIT Resources Ltd | 20 Cardigan Street CARLTON 3053 | 8102 | Higher Education | 144.964753 | -37.806312 |
| 4 | 2017 | 266 | 664005 | 664005 | Carlton | vacant | 24 Cardigan Street CARLTON 3053 | 0 | Vacant Space | 144.964772 | -37.806203 |
# Business establishments location and industry classification datatype information
industryinfo_classification.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 374210 entries, 0 to 374209 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 374210 non-null int64 1 block_id 374210 non-null int64 2 property_id 374210 non-null int64 3 base_property_id 374210 non-null int64 4 clue_small_area 374210 non-null object 5 trading_name 374083 non-null object 6 business_address 374209 non-null object 7 industry_anzsic4_code 374210 non-null int64 8 industry_anzsic4_description 374210 non-null object 9 longitude 369425 non-null float64 10 latitude 369425 non-null float64 dtypes: float64(2), int64(5), object(4) memory usage: 31.4+ MB
The industryinfo_classification dataset has 374,210 rows with 11 columns.
Data preprocessing is an essential step to ensure the dataset are ready for futher analysis. In this step, we begin by identifying and addressing any missing values to ensure that our data is complete and reliable. Next, we focus on detecting and removing any duplicate entries that may skew the analysis leading to unaccurate results. By cleaning data, we ehance the quality and integrity of the datasets, making them suitable for detailed exploration, analysis, and model building in subsequent steps.
Preprocessing street addresses datasets.
# Detecting missing values
missing_values = street_add.isnull().sum()
print(missing_values)
geo_point_2d 0 geo_shape 0 suburb_id 6 latitude 0 street_no 652 str_name 1 address_pnt 0 easting 0 northing 0 gisid 0 longitude 0 suburb 730 street_id 0 add_comp 62369 dtype: int64
In the street address dataset, the feature 'add_comp' has total of 62,369 missing values which is more than half entries.
# Removing the 'add_comp'
street_add1 = street_add.drop(columns=['add_comp'])
street_add1.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 63721 entries, 0 to 63720 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_point_2d 63721 non-null object 1 geo_shape 63721 non-null object 2 suburb_id 63715 non-null float64 3 latitude 63721 non-null float64 4 street_no 63069 non-null object 5 str_name 63720 non-null object 6 address_pnt 63721 non-null object 7 easting 63721 non-null float64 8 northing 63721 non-null float64 9 gisid 63721 non-null int64 10 longitude 63721 non-null float64 11 suburb 62991 non-null object 12 street_id 63721 non-null int64 dtypes: float64(5), int64(2), object(6) memory usage: 6.3+ MB
#Dropping rows with missing values in the 'street_add1' dataset.
street_add_cleaned = street_add1.dropna()
# Show the resulting DataFrame
street_add_cleaned1 = street_add_cleaned.isnull().sum()
print(street_add_cleaned1)
geo_point_2d 0 geo_shape 0 suburb_id 0 latitude 0 street_no 0 str_name 0 address_pnt 0 easting 0 northing 0 gisid 0 longitude 0 suburb 0 street_id 0 dtype: int64
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = street_add_cleaned.duplicated().sum()
print(f"Number of duplicate rows before removal: {duplicate_count_before}")
Number of duplicate rows before removal: 0
industrysize_jobs.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 14692 entries, 0 to 14691 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 14692 non-null int64 1 clue_small_area 14692 non-null object 2 anzsic_indusrty 14692 non-null object 3 clue_industry 14692 non-null object 4 business_size 14692 non-null object 5 total_establishments 14692 non-null int64 6 total_jobs 10365 non-null float64 dtypes: float64(1), int64(2), object(4) memory usage: 803.6+ KB
# Detecting missing values in the dataset
missing_values = industrysize_jobs.isnull().sum()
print(missing_values)
census_year 0 clue_small_area 0 anzsic_indusrty 0 clue_industry 0 business_size 0 total_establishments 0 total_jobs 4327 dtype: int64
#Calculating the composition the dataset based on business sizes.
business_size_industryjobs = industrysize_jobs['business_size'].value_counts()
print(business_size_industryjobs)
Small business 5201 Medium business 4392 Non employing 2931 Large business 2168 Name: business_size, dtype: int64
# Calculate the number of missing values in 'total job' based and grouping business sizes.
missing_values_jobs_business_size = industrysize_jobs[industrysize_jobs['total_jobs'].isnull()].groupby('business_size')['total_jobs'].size()
#Displaying the different missing values categorised based on different business size
print(missing_values_jobs_business_size)
business_size Large business 1092 Medium business 1297 Non employing 1211 Small business 727 Name: total_jobs, dtype: int64
# Grouping data by year and business size
grouped_data = industrysize_jobs.groupby(['census_year', 'business_size'])['total_jobs'].sum().unstack()
# Plotting
plt.figure(figsize=(10, 6))
for column in grouped_data.columns:
plt.plot(grouped_data.index, grouped_data[column], marker='o', label=column)
plt.title('Total Number of Jobs Over the Years by Business Size')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.legend(title='Business Size')
plt.grid(True)
plt.show()
# Descriptive statistic by grouping different business size
industrysize_jobs_desc = industrysize_jobs.groupby('business_size')['total_jobs'].describe()
industrysize_jobs_desc
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| business_size | ||||||||
| Large business | 1076.0 | 12633.487918 | 32292.079142 | 687.0 | 2084.5 | 3882.0 | 11539.25 | 273200.0 |
| Medium business | 3095.0 | 2418.815509 | 10420.953933 | 66.0 | 222.5 | 409.0 | 1267.00 | 153302.0 |
| Non employing | 1720.0 | 0.000000 | 0.000000 | 0.0 | 0.0 | 0.0 | 0.00 | 0.0 |
| Small business | 4474.0 | 984.121144 | 5117.223702 | 3.0 | 48.0 | 108.0 | 393.50 | 80008.0 |
# Replacing missing value with min value based on business size
industrysize_jobs['total_jobs'] = industrysize_jobs.groupby('business_size')['total_jobs'].transform(lambda x: x.fillna(x.min()))
# Grouping data by year and business size
grouped_data = industrysize_jobs.groupby(['census_year', 'business_size'])['total_jobs'].sum().unstack()
# Plotting
plt.figure(figsize=(10, 6))
for column in grouped_data.columns:
plt.plot(grouped_data.index, grouped_data[column], marker='o', label=column)
plt.title('Total Number of Jobs Over the Years by Business Size')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.legend(title='Business Size')
plt.grid(True)
plt.show()
The industrysize_jobs dataset contains 4,327 missing entries in the total_jobs column, which indicates the total number of jobs or employees hired across various business sizes. To address these missing values, we have imputed them with the minimum number of employees typical for each respective business size.
Using the minimum number of employees for imputation is practical because it provides a conservative estimate that avoids overestimating the employment impact of smaller businesses.
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = industrysize_jobs.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
Number of duplicate rows: 0
industryinfo_classification.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 374210 entries, 0 to 374209 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 374210 non-null int64 1 block_id 374210 non-null int64 2 property_id 374210 non-null int64 3 base_property_id 374210 non-null int64 4 clue_small_area 374210 non-null object 5 trading_name 374083 non-null object 6 business_address 374209 non-null object 7 industry_anzsic4_code 374210 non-null int64 8 industry_anzsic4_description 374210 non-null object 9 longitude 369425 non-null float64 10 latitude 369425 non-null float64 dtypes: float64(2), int64(5), object(4) memory usage: 31.4+ MB
# Detect missing values
missing_values = industryinfo_classification.isnull().sum()
print(missing_values)
census_year 0 block_id 0 property_id 0 base_property_id 0 clue_small_area 0 trading_name 127 business_address 1 industry_anzsic4_code 0 industry_anzsic4_description 0 longitude 4785 latitude 4785 dtype: int64
# Remove all rows which have missing values
industryinfo_classified=industryinfo_classification.dropna()
industryinfo_classified.isnull().sum()
census_year 0 block_id 0 property_id 0 base_property_id 0 clue_small_area 0 trading_name 0 business_address 0 industry_anzsic4_code 0 industry_anzsic4_description 0 longitude 0 latitude 0 dtype: int64
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = industryinfo_classified.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
Number of duplicate rows: 0
We have succesfully cleaned the dataset and replaced missing values appropriately.
We will conduct an exploratory analysis of business establishments and jobs data, focusing on how businesses are distributed across different industries and sizes, and how these factors influence total employment. By leveraging various visualizations and trend analyses, we aim to uncover key patterns and insights that will provide a deeper understanding of the economic landscape and inform strategic decision-making.
Initially, the exploratory data analysis is performed for the "Business establishments and jobs data by business size and industry" to get the hidden facts from the datatset.
# Create a pivot table
pivot_table = industrysize_jobs.pivot_table(
values='total_jobs',
index='anzsic_indusrty',
columns='business_size',
aggfunc='sum'
)
# Create a heatmap
plt.figure(figsize=(12, 8))
sns.heatmap(pivot_table, annot=True, fmt='.0f', cmap='YlGnBu')
plt.title('Heatmap of Total Jobs by Industry and Business Size')
plt.xlabel('Business Size')
plt.ylabel('Industry')
plt.show()
The heatmap reveals that large businesses are the primary contributors to job creation across most industries, particularly in "Financial and Insurance Services" and "Professional, Scientific and Technical Services." Medium and small businesses also play a vital role, especially in sectors like "Construction" and "Retail Trade." Non-employing businesses, though less impactful in job numbers, are present in all industries, with notable representation in "Retail Trade" and "Construction." This analysis underscores the significant role of large enterprises in employment, while highlighting the diversity and importance of smaller business sizes across various sectors.
# Create a stacked bar chart
plt.figure(figsize=(16, 8))
business_area_industry = industrysize_jobs.groupby(['clue_small_area', 'anzsic_indusrty']).agg({
'total_establishments': 'sum'
}).unstack().fillna(0)
business_area_industry.plot(kind='bar', stacked=True, ax=plt.gca())
plt.title('Total Establishments by Area and Industry')
plt.xlabel('Clue Small Area')
plt.ylabel('Total Establishments')
plt.xticks(rotation=45)
plt.legend(title='Industry', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
The bar chart illustrates the distribution of total business establishments across various Clue Small Areas within the City of Melbourne. The "City of Melbourne (total)" area, unsurprisingly, has the highest concentration of establishments, with a significant dominance in the "All ANZSIC" category, followed by "Melbourne (CBD)" which shows a diverse representation across multiple industries. Other areas like Docklands, South Yarra, and Port Melbourne have a much smaller, yet varied, distribution of establishments across different industries. This visualization highlights the central role of Melbourne's CBD in hosting a large number of businesses, reflecting its importance as a key economic hub in the region.
# Creating a pairplot
plt.figure(figsize=(12, 8))
sns.pairplot(industrysize_jobs, hue='business_size', vars=['total_establishments', 'total_jobs'])
# Set title (note: title will not appear directly on the pairplot)
plt.suptitle('Pair Plot of Total Establishments and Total Jobs by Business Size', y=1.02, fontsize=16)
plt.show()
<Figure size 1200x800 with 0 Axes>
The pair plot visualizes the relationships between total establishments and total jobs across different business sizes. It shows that large businesses generally have a higher number of jobs but fewer establishments compared to small and medium-sized businesses. Small businesses, represented by red dots, are more concentrated in both the number of establishments and jobs, indicating a larger presence but fewer employees per establishment. Medium businesses occupy a middle ground, while non-employing businesses (green) are clustered close to the origin, reflecting minimal job creation despite a number of establishments. This plot helps to identify the scale and impact of different business sizes within the dataset.
# Convert the 'census_year' column to datetime format, if it's not already
industrysize_jobs['census_year'] = pd.to_datetime(industrysize_jobs['census_year'], errors='coerce', format='%Y')
# Group data by year and industry, then calculate the sum of total establishments and total jobs
industry_yearly_data = industrysize_jobs.groupby([industrysize_jobs['census_year'].dt.year, 'anzsic_indusrty']).agg({
'total_establishments': 'sum',
'total_jobs': 'sum'
}).reset_index()
# Create a line plot for total establishments over time, by industry
fig_establishments = px.line(
industry_yearly_data,
x='census_year',
y='total_establishments',
color='anzsic_indusrty',
title='Yearly Trends in Total Establishments by Industry',
labels={'census_year': 'Year', 'total_establishments': 'Total Establishments'},
markers=True
)
fig_establishments.update_layout(
xaxis_title='Year',
yaxis_title='Total Establishments',
legend_title_text='Industry',
width=900,
height=600
)
fig_establishments.show()
# Create a line plot for total jobs over time, by industry
fig_jobs = px.line(
industry_yearly_data,
x='census_year',
y='total_jobs',
color='anzsic_indusrty',
title='Yearly Trends in Total Jobs by Industry',
labels={'census_year': 'Year', 'total_jobs': 'Total Jobs'},
markers=True
)
fig_jobs.update_layout(
xaxis_title='Year',
yaxis_title='Total Jobs',
legend_title_text='Industry',
width=900,
height=600
)
fig_jobs.show()
It shows the yearly trends in total establishments by industry in City of Melbourne from 2005 to 2020. The graph displays the number of total establishments for various industries over this time period. The industries shown include Accommodation and Food Services, Administrative and Support Services, Agriculture, Forestry and Fishing, All ANZSIC, Arts and Recreation Services, Construction, Education and Training, Electricity, Gas, Water and Waste Services, Financial and Insurance Services, Health Care and Social Assistance, Information Media and Telecommunications, Manufacturing, Mining, Other Services, Professional, Scientific and Technical Services, Public Administration and Safety, Rental, Hiring and Real Estate Services, Retail Trade, and Transport, Postal and Warehousing, and Wholesale Trade.
The visualization shows the yearly trends in total jobs by industry in a stacked area chart. The chart displays the changes in the number of jobs across various industries over the years from 2005 to 2020. Each industry is represented by a different color, allowing for a clear visualization of the relative growth or decline in the number of jobs within each industry over time. The chart provides a comprehensive overview of the dynamic nature of the job market, highlighting the industries that have experienced significant changes, such as Accommodation and Food Services, Administrative and Support Services, and Manufacturing. This information can be valuable for understanding the economic landscape, identifying emerging industries, and making informed decisions about workforce development and business strategies.
Further the analysis is performed for the "Business establishments location and industry classification" to uncover the hidden information for the dataset.
The worldcloud is displayed and most important words with the business names are shown here. It is very useful to finds them.
# Word Cloud of Business Names
wordcloud = WordCloud(width=1600, height=800, background_color='white').generate(' '.join(industryinfo_classified['trading_name']))
plt.figure(figsize=(20,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.title('Word Cloud of Business Names')
plt.show()
A correlation heatmap of numeric variables. The heatmap displays the correlation coefficients between different variables, such as census_year, block_id, property_id, base_property_id, industry_anzsic4_code, longitude, latitude, and others. The color scale ranges from -1 (strong negative correlation, shown in dark red) to 1 (strong positive correlation, shown in dark blue), with white indicating no correlation.
Some key observations from the heatmap:
Overall, the heatmap provides a visual representation of the relationships between the different numeric variables in the dataset. Claude can make mistakes. Please double-check responses.
# Correlation Heatmap
numeric_df = industryinfo_classified.select_dtypes(include=[np.number])
plt.figure(figsize=(12, 10))
sns.heatmap(numeric_df.corr(), annot=True, cmap='coolwarm', linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Variables')
plt.show()
A business density heatmap by CLUE (Census Local Unit Establishments) Small Area and Census Year for various locations in the City of Melbourne is presented here. The heatmap displays the number of businesses or establishments in each CLUE Small Area over time, from 2002 to 2022.
The key observations from the heatmap are:
This information can be useful for understanding the spatial distribution and trends of business activities in the Melbourne metropolitan area.
# Business Density Heatmap by CLUE Small Area
pivot = industryinfo_classified.pivot_table(
index='clue_small_area',
columns='census_year',
values='property_id',
aggfunc='count'
)
plt.figure(figsize=(15, 10))
sns.heatmap(pivot, cmap='YlOrRd', annot=True, fmt='g')
plt.title('Business Density Heatmap by CLUE Small Area and Census Year')
plt.xlabel('Census Year')
plt.ylabel('CLUE Small Area')
plt.show()
The industry composition in the top 10 CLUE (Census Local Unit Establishments) Small Areas in Melbourne, Australia given here. The vertical axis represents the total number of businesses, and the horizontal axis lists the different CLUE Small Areas.
The key observations from the chart are:
This information provides insights into the industry composition and business landscape within the top 10 CLUE Small Areas in Melbourne, which can be useful for understanding the economic activity and trends in the city.
# Industry Composition by CLUE Small Area
top_areas = industryinfo_classified['clue_small_area'].value_counts().nlargest(10).index
top_industries = industryinfo_classified['industry_anzsic4_description'].value_counts().nlargest(5).index
industry_area = pd.crosstab(
industryinfo_classified['clue_small_area'],
industryinfo_classified['industry_anzsic4_description']
)
industry_area = industry_area.loc[top_areas, top_industries]
industry_area.plot(kind='bar', stacked=True, figsize=(15, 10))
plt.title('Industry Composition in Top 10 CLUE Small Areas')
plt.xlabel('CLUE Small Area')
plt.ylabel('Number of Businesses')
plt.legend(title='Industry', bbox_to_anchor=(1.05, 1), loc='upper left')
plt.tight_layout()
plt.show()
The grpagh shows the geographical distribution of businesses by industry in a specific area. Each dot on the map represents a business, color-coded by industry type. The map reveals the clustering and dispersion of different types of businesses across the geographic region. For example, there appears to be a concentration of businesses in certain industries in certain areas, while other industries are more spread out. This information could be useful for understanding the economic landscape and patterns of commercial activity in the region.
# Create a scatter plot with latitude, longitude, and industry
fig = px.scatter(
industryinfo_classified,
x='longitude',
y='latitude',
color='industry_anzsic4_description', # Color by industry description
size='block_id', # Size by number of businesses in each block
hover_name='trading_name', # Show trading name on hover
title='Geographical Distribution of Businesses by Industry',
labels={'longitude': 'Longitude', 'latitude': 'Latitude'}
)
fig.show()
A treemap visualization of businesses by area and industry in Melbourne, Australia is shown here. The different areas of Melbourne (CBD, Carlton, Docklands, East Melbourne, West Melbourne, Port Melbourne, Melbourne Remainder, Southbank, North Melbourne, Kensington, Parkville, and South Yarra) are represented as the larger rectangles, and the industries within each area are shown as smaller nested rectangles.
The size of each rectangle corresponds to the relative number of businesses in that industry and area. The industries shown include Vacant Space, Legal Services, Takeaway Food Services, Cafes and Restaurants, and others.
Some key observations:
This treemap visualization provides a comprehensive overview of the spatial distribution and industry composition of businesses across the Melbourne metropolitan area, which can be useful for understanding the economic landscape and identifying potential opportunities or trends.
import plotly.express as px
# Prepare the data for treemap
treemap_data = industryinfo_classified.groupby(['clue_small_area', 'industry_anzsic4_description']).size().reset_index(name='counts')
# Create the treemap
fig = px.treemap(treemap_data,
path=['clue_small_area', 'industry_anzsic4_description'],
values='counts',
title='Treemap of Businesses by Area and Industry')
fig.show()
The distribution of easting (a geographic coordinate) across different suburbs in the Melbourne area is listed here. Each data point represents a suburb, and the vertical axis shows the easting value for that suburb.
Some key observations:
This visualization is useful for understanding the geographic positioning of the different suburbs in relation to the east-west axis, which may be relevant for urban planning, transportation, or other spatial analyses involving the Melbourne metropolitan area.
# Box plot of easting across suburbs
plt.figure(figsize=(14, 6))
sns.boxplot(data=street_add_cleaned, x='suburb', y='easting')
plt.xticks(rotation=90)
plt.title('Distribution of Easting Across Suburbs')
plt.xlabel('Suburb')
plt.ylabel('Easting')
plt.show()
# Box plot of northing across suburbs
plt.figure(figsize=(14, 6))
sns.boxplot(data=street_add_cleaned, x='suburb', y='northing')
plt.xticks(rotation=90)
plt.title('Distribution of Northing Across Suburbs')
plt.xlabel('Suburb')
plt.ylabel('Northing')
plt.show()
For data modeling, we plan to implement both classification and regression analysis to derive actionable insights from our dataset. The classification model will predict the business size (business_size) as our target class, utilizing features such as census_year, longitude, latitude, clue_small_area, industry_anzsic4_code, and total_establishments. This analysis will help in understanding the factors that determine whether a business is classified as small, medium, or large.
Simultaneously, the regression model will focus on predicting the total number of jobs (total_jobs) as the target variable, leveraging the same set of features. This regression analysis will allow us to forecast employment trends based on business characteristics and location data. Through careful selection of features and model tuning, we aim to uncover patterns that can inform economic and urban planning decisions, with the classification model providing insights into business categorization and the regression model offering projections on job creation.
Now we want to to merge the data for modeling phase.
street_add_cleaned.info()
industrysize_jobs.info()
industryinfo_classified.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 62338 entries, 0 to 63720 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geo_point_2d 62338 non-null object 1 geo_shape 62338 non-null object 2 suburb_id 62338 non-null float64 3 latitude 62338 non-null float64 4 street_no 62338 non-null object 5 str_name 62338 non-null object 6 address_pnt 62338 non-null object 7 easting 62338 non-null float64 8 northing 62338 non-null float64 9 gisid 62338 non-null int64 10 longitude 62338 non-null float64 11 suburb 62338 non-null object 12 street_id 62338 non-null int64 dtypes: float64(5), int64(2), object(6) memory usage: 6.7+ MB <class 'pandas.core.frame.DataFrame'> RangeIndex: 14692 entries, 0 to 14691 Data columns (total 7 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 14692 non-null datetime64[ns] 1 clue_small_area 14692 non-null object 2 anzsic_indusrty 14692 non-null object 3 clue_industry 14692 non-null object 4 business_size 14692 non-null object 5 total_establishments 14692 non-null int64 6 total_jobs 14692 non-null float64 dtypes: datetime64[ns](1), float64(1), int64(1), object(4) memory usage: 803.6+ KB <class 'pandas.core.frame.DataFrame'> Int64Index: 369329 entries, 0 to 374209 Data columns (total 11 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 369329 non-null int64 1 block_id 369329 non-null int64 2 property_id 369329 non-null int64 3 base_property_id 369329 non-null int64 4 clue_small_area 369329 non-null object 5 trading_name 369329 non-null object 6 business_address 369329 non-null object 7 industry_anzsic4_code 369329 non-null int64 8 industry_anzsic4_description 369329 non-null object 9 longitude 369329 non-null float64 10 latitude 369329 non-null float64 dtypes: float64(2), int64(5), object(4) memory usage: 33.8+ MB
Merging the dataset
# Ensure Busi_jobs_size_industry has only one row per clue_small_area
industrysize_jobs_unique = industrysize_jobs.groupby('clue_small_area').first().reset_index()
# Merge the DataFrames
merged_df5 = pd.merge(industryinfo_classified,
industrysize_jobs_unique[['clue_small_area', 'anzsic_indusrty', 'clue_industry', 'business_size', 'total_establishments', 'total_jobs']],
on='clue_small_area',
how='left')
# Check the result
print(f"Number of rows in merged DataFrame: {len(merged_df5)}")
print(merged_df5.info())
Number of rows in merged DataFrame: 369329 <class 'pandas.core.frame.DataFrame'> Int64Index: 369329 entries, 0 to 369328 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 369329 non-null int64 1 block_id 369329 non-null int64 2 property_id 369329 non-null int64 3 base_property_id 369329 non-null int64 4 clue_small_area 369329 non-null object 5 trading_name 369329 non-null object 6 business_address 369329 non-null object 7 industry_anzsic4_code 369329 non-null int64 8 industry_anzsic4_description 369329 non-null object 9 longitude 369329 non-null float64 10 latitude 369329 non-null float64 11 anzsic_indusrty 369329 non-null object 12 clue_industry 369329 non-null object 13 business_size 369329 non-null object 14 total_establishments 369329 non-null int64 15 total_jobs 369329 non-null float64 dtypes: float64(3), int64(6), object(7) memory usage: 47.9+ MB None
# Detect missing values
missing_values = merged_df5.isnull().sum()
print(missing_values)
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = merged_df5.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
census_year 0 block_id 0 property_id 0 base_property_id 0 clue_small_area 0 trading_name 0 business_address 0 industry_anzsic4_code 0 industry_anzsic4_description 0 longitude 0 latitude 0 anzsic_indusrty 0 clue_industry 0 business_size 0 total_establishments 0 total_jobs 0 dtype: int64 Number of duplicate rows: 0
# Merge street_address and merge data on latitude and longitude
merged_df2 = pd.merge(merged_df5, street_add_cleaned,
on=['latitude', 'longitude'],
how='left')
merged_df2.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 369329 entries, 0 to 369328 Data columns (total 27 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 369329 non-null int64 1 block_id 369329 non-null int64 2 property_id 369329 non-null int64 3 base_property_id 369329 non-null int64 4 clue_small_area 369329 non-null object 5 trading_name 369329 non-null object 6 business_address 369329 non-null object 7 industry_anzsic4_code 369329 non-null int64 8 industry_anzsic4_description 369329 non-null object 9 longitude 369329 non-null float64 10 latitude 369329 non-null float64 11 anzsic_indusrty 369329 non-null object 12 clue_industry 369329 non-null object 13 business_size 369329 non-null object 14 total_establishments 369329 non-null int64 15 total_jobs 369329 non-null float64 16 geo_point_2d 0 non-null object 17 geo_shape 0 non-null object 18 suburb_id 0 non-null float64 19 street_no 0 non-null object 20 str_name 0 non-null object 21 address_pnt 0 non-null object 22 easting 0 non-null float64 23 northing 0 non-null float64 24 gisid 0 non-null float64 25 suburb 0 non-null object 26 street_id 0 non-null float64 dtypes: float64(8), int64(6), object(13) memory usage: 78.9+ MB
# Detect missing values
missing_values = merged_df2.isnull().sum()
print(missing_values)
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = merged_df2.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
# List of columns to remove
columns_to_remove = [
'geo_point_2d',
'geo_shape',
'suburb_id',
'street_no',
'str_name',
'address_pnt',
'easting',
'northing',
'gisid',
'suburb',
'street_id'
]
# Remove the specified columns
cleaned_df = merged_df2.drop(columns=columns_to_remove, errors='ignore')
print(cleaned_df.info())
census_year 0 block_id 0 property_id 0 base_property_id 0 clue_small_area 0 trading_name 0 business_address 0 industry_anzsic4_code 0 industry_anzsic4_description 0 longitude 0 latitude 0 anzsic_indusrty 0 clue_industry 0 business_size 0 total_establishments 0 total_jobs 0 geo_point_2d 369329 geo_shape 369329 suburb_id 369329 street_no 369329 str_name 369329 address_pnt 369329 easting 369329 northing 369329 gisid 369329 suburb 369329 street_id 369329 dtype: int64 Number of duplicate rows: 0 <class 'pandas.core.frame.DataFrame'> Int64Index: 369329 entries, 0 to 369328 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 369329 non-null int64 1 block_id 369329 non-null int64 2 property_id 369329 non-null int64 3 base_property_id 369329 non-null int64 4 clue_small_area 369329 non-null object 5 trading_name 369329 non-null object 6 business_address 369329 non-null object 7 industry_anzsic4_code 369329 non-null int64 8 industry_anzsic4_description 369329 non-null object 9 longitude 369329 non-null float64 10 latitude 369329 non-null float64 11 anzsic_indusrty 369329 non-null object 12 clue_industry 369329 non-null object 13 business_size 369329 non-null object 14 total_establishments 369329 non-null int64 15 total_jobs 369329 non-null float64 dtypes: float64(3), int64(6), object(7) memory usage: 47.9+ MB None
cleaned_df.head()
| census_year | block_id | property_id | base_property_id | clue_small_area | trading_name | business_address | industry_anzsic4_code | industry_anzsic4_description | longitude | latitude | anzsic_indusrty | clue_industry | business_size | total_establishments | total_jobs | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 2017 | 266 | 109851 | 109851 | Carlton | Metropoli's Research Pty Ltd | Level 1, 74 Victoria Street CARLTON 3053 | 6950 | Market Research and Statistical Services | 144.965352 | -37.806701 | Administrative and Support Services | Admin and Support Services | Medium business | 7 | 209.0 |
| 1 | 2017 | 266 | 109851 | 109851 | Carlton | J Hong Restaurant | Ground , 74 Victoria Street CARLTON 3053 | 4511 | Cafes and Restaurants | 144.965352 | -37.806701 | Administrative and Support Services | Admin and Support Services | Medium business | 7 | 209.0 |
| 2 | 2017 | 266 | 534003 | 534003 | Carlton | St2 Expresso | 70 Victoria Street CARLTON 3053 | 4512 | Takeaway Food Services | 144.965473 | -37.806714 | Administrative and Support Services | Admin and Support Services | Medium business | 7 | 209.0 |
| 3 | 2017 | 266 | 664003 | 664003 | Carlton | RMIT Resources Ltd | 20 Cardigan Street CARLTON 3053 | 8102 | Higher Education | 144.964753 | -37.806312 | Administrative and Support Services | Admin and Support Services | Medium business | 7 | 209.0 |
| 4 | 2017 | 266 | 664005 | 664005 | Carlton | vacant | 24 Cardigan Street CARLTON 3053 | 0 | Vacant Space | 144.964772 | -37.806203 | Administrative and Support Services | Admin and Support Services | Medium business | 7 | 209.0 |
# Detect missing values
missing_values = merged_df5.isnull().sum()
print(missing_values)
# Check for duplicates and show the number of duplicate rows
duplicate_count_before = merged_df5.duplicated().sum()
print(f"Number of duplicate rows: {duplicate_count_before}")
census_year 0 block_id 0 property_id 0 base_property_id 0 clue_small_area 0 trading_name 0 business_address 0 industry_anzsic4_code 0 industry_anzsic4_description 0 longitude 0 latitude 0 anzsic_indusrty 0 clue_industry 0 business_size 0 total_establishments 0 total_jobs 0 dtype: int64 Number of duplicate rows: 0
The target for classification model is to be able to predict the business size (business_size) i.e., target class, using a robust set of features.
cleaned_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 369329 entries, 0 to 369328 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 369329 non-null int64 1 block_id 369329 non-null int64 2 property_id 369329 non-null int64 3 base_property_id 369329 non-null int64 4 clue_small_area 369329 non-null object 5 trading_name 369329 non-null object 6 business_address 369329 non-null object 7 industry_anzsic4_code 369329 non-null int64 8 industry_anzsic4_description 369329 non-null object 9 longitude 369329 non-null float64 10 latitude 369329 non-null float64 11 anzsic_indusrty 369329 non-null object 12 clue_industry 369329 non-null object 13 business_size 369329 non-null object 14 total_establishments 369329 non-null int64 15 total_jobs 369329 non-null float64 dtypes: float64(3), int64(6), object(7) memory usage: 47.9+ MB
cleaned_df['business_size'].value_counts() # target class
Large business 254033 Medium business 83769 Non employing 19451 Small business 12076 Name: business_size, dtype: int64
Mapping categorical variables to numerical variables
# Mapping business_size to numeric values
business_size_mapping = {
'Large business': 3,
'Medium business': 2,
'Small business': 1,
'Non employing': 0
}
# Apply the mapping to the 'business_size' column
cleaned_df['business_size_numeric'] = cleaned_df['business_size'].map(business_size_mapping)
# Check the conversion
print(cleaned_df[['business_size', 'business_size_numeric']].head())
business_size business_size_numeric 0 Medium business 2 1 Medium business 2 2 Medium business 2 3 Medium business 2 4 Medium business 2
# Mapping clue_small_area to numeric values
clue_small_area_mapping = {
'Melbourne (CBD)': 0,
'Carlton': 1,
'North Melbourne': 2,
'Docklands': 3,
'Southbank': 4,
'East Melbourne': 5,
'West Melbourne (Residential)': 6,
'Port Melbourne': 7,
'Kensington': 8,
'Melbourne (Remainder)': 9,
'West Melbourne (Industrial)': 10,
'Parkville': 11,
'South Yarra': 12
}
# Apply the mapping to the 'clue_small_area' column
cleaned_df['clue_small_area_numeric'] = cleaned_df['clue_small_area'].map(clue_small_area_mapping)
# Check the conversion
print(cleaned_df[['clue_small_area', 'clue_small_area_numeric']].head())
clue_small_area clue_small_area_numeric 0 Carlton 1 1 Carlton 1 2 Carlton 1 3 Carlton 1 4 Carlton 1
# Mapping clue_industry to numeric values
clue_industry_mapping = {
'Accommodation': 0,
'Food and Beverage Services': 1,
'Admin and Support Services': 2,
'Health Care and Social Assistance': 3
}
# Apply the mapping to the 'clue_industry' column
cleaned_df['clue_industry_numeric'] = cleaned_df['clue_industry'].map(clue_industry_mapping)
# Check the conversion
print(cleaned_df[['clue_industry', 'clue_industry_numeric']].head())
clue_industry clue_industry_numeric 0 Admin and Support Services 2 1 Admin and Support Services 2 2 Admin and Support Services 2 3 Admin and Support Services 2 4 Admin and Support Services 2
cleaned_df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 369329 entries, 0 to 369328 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 369329 non-null int64 1 block_id 369329 non-null int64 2 property_id 369329 non-null int64 3 base_property_id 369329 non-null int64 4 clue_small_area 369329 non-null object 5 trading_name 369329 non-null object 6 business_address 369329 non-null object 7 industry_anzsic4_code 369329 non-null int64 8 industry_anzsic4_description 369329 non-null object 9 longitude 369329 non-null float64 10 latitude 369329 non-null float64 11 anzsic_indusrty 369329 non-null object 12 clue_industry 369329 non-null object 13 business_size 369329 non-null object 14 total_establishments 369329 non-null int64 15 total_jobs 369329 non-null float64 16 business_size_numeric 369329 non-null int64 17 clue_small_area_numeric 369329 non-null int64 18 clue_industry_numeric 369329 non-null int64 dtypes: float64(3), int64(9), object(7) memory usage: 56.4+ MB
# correlation plot
# Define the features and target variable
features = ['census_year', 'longitude', 'latitude', 'clue_small_area_numeric',
'industry_anzsic4_code', 'total_establishments', 'clue_industry_numeric']
target = 'business_size_numeric'
# Create a new DataFrame with the selected features and target variable
correlation_df = cleaned_df[features + [target]]
# Calculate the correlation matrix
correlation_matrix = correlation_df.corr()
# Optionally, visualize the correlation matrix with a heatmap
import seaborn as sns
import matplotlib.pyplot as plt
plt.figure(figsize=(10, 6))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap='coolwarm', square=True)
plt.title("Correlation Matrix")
plt.show()
# the features and target variable
features = ['longitude', 'latitude',
'industry_anzsic4_code', 'total_establishments']
target = 'business_size_numeric'
# Create a StandardScaler instance
scaler = StandardScaler()
# Fit the scaler to the features and transform them
normalized_features = scaler.fit_transform(cleaned_df[features])
# Create a new DataFrame with normalized features
normalized_df = pd.DataFrame(normalized_features, columns=features)
# Display the first few rows of the normalized DataFrame
print(normalized_df.head())
longitude latitude industry_anzsic4_code total_establishments 0 0.466506 0.681440 0.620677 2.05035 1 0.466506 0.681440 -0.218055 2.05035 2 0.475255 0.679954 -0.217711 2.05035 3 0.422856 0.724366 1.016831 2.05035 4 0.424256 0.736360 -1.769314 2.05035
# Split the data into training and testing sets 70%, 30%
X_train, X_test, y_train, y_test = train_test_split(normalized_df[features], cleaned_df[target], test_size=0.3, random_state=42)
# Create and train the Random Forest classifier
rf_classifier = RandomForestClassifier(random_state=42)
rf_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred = rf_classifier.predict(X_test)
# Compute the classification report
class_summary = classification_report(y_test, y_pred, target_names=['Non employing', 'Small business', 'Medium business', 'Large business'],digits=4)
# Compute the confusion matrix
confusion_mat = confusion_matrix(y_test, y_pred)
# Display the results
print("Classification Report:\n", class_summary)
print("Confusion Matrix:\n", confusion_mat)
# Optionally, visualize the confusion matrix with a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat, annot=True, fmt='d', cmap='Blues', xticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'], yticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix')
plt.show()
Classification Report:
precision recall f1-score support
Non employing 1.0000 1.0000 1.0000 5879
Small business 1.0000 1.0000 1.0000 3627
Medium business 1.0000 1.0000 1.0000 24968
Large business 1.0000 1.0000 1.0000 76325
accuracy 1.0000 110799
macro avg 1.0000 1.0000 1.0000 110799
weighted avg 1.0000 1.0000 1.0000 110799
Confusion Matrix:
[[ 5879 0 0 0]
[ 0 3627 0 0]
[ 0 0 24968 0]
[ 0 0 0 76325]]
# Create and train the Logistic Regression classifier
log_reg_classifier = LogisticRegression(
C=0.5, # Adjust regularization strength
penalty='l2', # Use L2 regularization
solver='lbfgs', # Use 'lbfgs' solver
max_iter=1000, # Keep the maximum iterations
class_weight='balanced', # Adjust weights for imbalanced classes
tol=1e-4, # Set a tolerance for convergence
multi_class='multinomial' # Use softmax for multi-class classification
)
log_reg_classifier.fit(X_train, y_train)
# Make predictions on the test set
y_pred_log_reg = log_reg_classifier.predict(X_test)
# Compute the classification report
class_summary_log_reg = classification_report(y_test, y_pred_log_reg,
target_names=['Non employing', 'Small business', 'Medium business', 'Large business'],
digits=4)
# Compute the confusion matrix
confusion_mat_log_reg = confusion_matrix(y_test, y_pred_log_reg)
# Display the results
print("Classification Report for Logistic Regression:\n", class_summary_log_reg)
print("Confusion Matrix for Logistic Regression:\n", confusion_mat_log_reg)
# Optionally, visualize the confusion matrix with a heatmap
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_mat_log_reg, annot=True, fmt='d', cmap='Blues',
xticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'],
yticklabels=['Non employing', 'Small business', 'Medium business', 'Large business'])
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Confusion Matrix for Logistic Regression')
plt.show()
Classification Report for Logistic Regression:
precision recall f1-score support
Non employing 0.8134 1.0000 0.8971 5879
Small business 0.6284 0.9810 0.7661 3627
Medium business 0.7441 0.8345 0.7867 24968
Large business 0.9695 0.8880 0.9270 76325
accuracy 0.8849 110799
macro avg 0.7889 0.9259 0.8442 110799
weighted avg 0.8993 0.8849 0.8885 110799
Confusion Matrix for Logistic Regression:
[[ 5879 0 0 0]
[ 0 3558 8 61]
[ 0 2062 20837 2069]
[ 1349 42 7157 67777]]
num_classes = len(y_train.unique())
model = Sequential([
Dense(128,activation = 'relu', input_shape=(4,)),
Dense(64,activation = 'relu'),
Dense(num_classes,activation = 'softmax')
])
model.compile(optimizer='adam',
loss = 'sparse_categorical_crossentropy',
metrics =['accuracy'])
history = model.fit(X_train,y_train,epochs =20,batch_size=32,validation_split=0.2)
test_loss,test_acc = model.evaluate(X_test,y_test)
print("Test Accuracy ",test_acc)
Epoch 1/20 6464/6464 [==============================] - 5s 760us/step - loss: 0.0208 - accuracy: 0.9948 - val_loss: 0.0030 - val_accuracy: 0.9983 Epoch 2/20 6464/6464 [==============================] - 5s 714us/step - loss: 0.0026 - accuracy: 0.9991 - val_loss: 0.0013 - val_accuracy: 0.9996 Epoch 3/20 6464/6464 [==============================] - 5s 716us/step - loss: 0.0016 - accuracy: 0.9995 - val_loss: 5.9158e-04 - val_accuracy: 0.9999 Epoch 4/20 6464/6464 [==============================] - 5s 739us/step - loss: 0.0011 - accuracy: 0.9997 - val_loss: 4.7069e-04 - val_accuracy: 0.9999 Epoch 5/20 6464/6464 [==============================] - 5s 757us/step - loss: 6.7552e-04 - accuracy: 0.9998 - val_loss: 2.7480e-04 - val_accuracy: 0.9999 Epoch 6/20 6464/6464 [==============================] - 5s 755us/step - loss: 4.0970e-04 - accuracy: 0.9999 - val_loss: 2.5435e-04 - val_accuracy: 0.9999 Epoch 7/20 6464/6464 [==============================] - 5s 747us/step - loss: 5.3689e-04 - accuracy: 0.9999 - val_loss: 3.5930e-04 - val_accuracy: 0.9999 Epoch 8/20 6464/6464 [==============================] - 5s 767us/step - loss: 3.1196e-04 - accuracy: 0.9999 - val_loss: 7.4120e-04 - val_accuracy: 0.9996 Epoch 9/20 6464/6464 [==============================] - 5s 819us/step - loss: 4.6061e-04 - accuracy: 0.9999 - val_loss: 6.6086e-04 - val_accuracy: 0.9998 Epoch 10/20 6464/6464 [==============================] - 5s 772us/step - loss: 1.8890e-04 - accuracy: 0.9999 - val_loss: 1.1298e-05 - val_accuracy: 1.0000 Epoch 11/20 6464/6464 [==============================] - 5s 732us/step - loss: 4.6118e-04 - accuracy: 0.9999 - val_loss: 2.5336e-05 - val_accuracy: 1.0000 Epoch 12/20 6464/6464 [==============================] - 5s 721us/step - loss: 3.1193e-04 - accuracy: 0.9999 - val_loss: 1.1105e-04 - val_accuracy: 1.0000 Epoch 13/20 6464/6464 [==============================] - 5s 728us/step - loss: 3.8306e-04 - accuracy: 0.9999 - val_loss: 2.2264e-04 - val_accuracy: 0.9999 Epoch 14/20 6464/6464 [==============================] - 5s 734us/step - loss: 1.0564e-04 - accuracy: 1.0000 - val_loss: 1.6396e-05 - val_accuracy: 1.0000 Epoch 15/20 6464/6464 [==============================] - 5s 726us/step - loss: 3.6668e-04 - accuracy: 0.9999 - val_loss: 3.0403e-04 - val_accuracy: 0.9999 Epoch 16/20 6464/6464 [==============================] - 5s 728us/step - loss: 2.3317e-04 - accuracy: 0.9999 - val_loss: 4.3953e-05 - val_accuracy: 1.0000 Epoch 17/20 6464/6464 [==============================] - 5s 729us/step - loss: 1.5432e-04 - accuracy: 0.9999 - val_loss: 1.2579e-05 - val_accuracy: 1.0000 Epoch 18/20 6464/6464 [==============================] - 5s 745us/step - loss: 2.7530e-04 - accuracy: 0.9999 - val_loss: 0.0039 - val_accuracy: 0.9993 Epoch 19/20 6464/6464 [==============================] - 5s 727us/step - loss: 5.3759e-04 - accuracy: 0.9999 - val_loss: 1.2479e-05 - val_accuracy: 1.0000 Epoch 20/20 6464/6464 [==============================] - 5s 720us/step - loss: 1.2971e-04 - accuracy: 1.0000 - val_loss: 5.3994e-05 - val_accuracy: 0.9999 3463/3463 [==============================] - 2s 505us/step - loss: 2.5325e-05 - accuracy: 1.0000 Test Accuracy 0.9999819397926331
predictions = model.predict(X_test)
predicted_classes = predictions.argmax(axis=1)
history_dict = history.history
loss_values = history_dict['loss']
val_loss_values = history_dict['val_loss']
epochs = range(1, len(loss_values) + 1)
plt.plot(epochs, loss_values, 'bo', label='Training loss')
plt.plot(epochs, val_loss_values, 'b', label='Validation loss')
plt.title('Training and validation loss')
plt.xlabel('Epochs')
plt.ylabel('Loss')
plt.legend()
plt.show()
In this phase of the analysis, we will apply regression techniques to model the relationships between the features and a continuous target variable. Regression analysis is valuable for understanding how the independent variables influence the dependent variable and for making predictions based on this understanding. We will utilize a suitable regression model, such as Linear Regression, to assess the influence of various features on the target variable, which could be the number of jobs or total establishments. The model will be evaluated using metrics such as R² score, Mean Absolute Error (MAE), and Mean Squared Error (MSE) to quantify its performance. Through this analysis, we aim to uncover insights into the dynamics of business establishments and employment within the dataset.
# Define the target variable (e.g., total_jobs) and features
target = 'total_jobs'
features = ['census_year', 'longitude', 'latitude', 'clue_small_area_numeric',
'industry_anzsic4_code', 'total_establishments', 'clue_industry_numeric']
# Split the data into training and testing sets
X = cleaned_df[features]
y = cleaned_df[target]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Create and train the Linear Regression model
linear_regressor = LinearRegression()
linear_regressor.fit(X_train, y_train)
# Make predictions on the test set
y_pred = linear_regressor.predict(X_test)
# Evaluate the model
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
# Display the evaluation metrics
print("Mean Squared Error (MSE):", mse)
print("Mean Absolute Error (MAE):", mae)
print("R² Score:", r2)
# Optionally, visualize the predictions vs. actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, alpha=0.7)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Actual vs. Predicted Values')
plt.show()
Mean Squared Error (MSE): 130923.17767442112 Mean Absolute Error (MAE): 262.99255646703307 R² Score: 0.6645396958258626
# Create and train the Random Forest Regressor
rf_regressor = RandomForestRegressor(n_estimators=100, random_state=42) # You can adjust n_estimators
rf_regressor.fit(X_train, y_train)
# Make predictions on the test set
y_pred_rf = rf_regressor.predict(X_test)
# Evaluate the model
mse_rf = mean_squared_error(y_test, y_pred_rf)
mae_rf = mean_absolute_error(y_test, y_pred_rf)
r2_rf = r2_score(y_test, y_pred_rf)
# Display the evaluation metrics
print("Random Forest Regressor Evaluation Metrics:")
print("Mean Squared Error (MSE):", mse_rf)
print("Mean Absolute Error (MAE):", mae_rf)
print("R² Score:", r2_rf)
# Optionally, visualize the predictions vs. actual values
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred_rf, alpha=0.7)
plt.plot([y.min(), y.max()], [y.min(), y.max()], 'k--', lw=2)
plt.xlabel('Actual Values')
plt.ylabel('Predicted Values')
plt.title('Random Forest: Actual vs. Predicted Values')
plt.show()
Random Forest Regressor Evaluation Metrics: Mean Squared Error (MSE): 0.0 Mean Absolute Error (MAE): 0.0 R² Score: 1.0
Our spatial analysis approach employs a comprehensive set of techniques to analyze and visualize geographical data, particularly focusing on address distribution patterns. We begin with exploratory data visualization using heatmaps and 3D scatter plots to understand the overall density and distribution of addresses. To identify clusters and patterns, we implement DBSCAN clustering with convex hulls, providing insights into the spatial grouping of addresses. We then utilize spatial joins to relate our data points to specific landmarks, offering context to the geographical layout. For a more nuanced understanding of density, we employ Kernel Density Estimation (KDE) plots and hexbin plots. To assess the spatial randomness and clustering of our data, we conduct Ripley's K Function analysis. This multi-faceted approach allows us to examine the spatial characteristics of our address data from various perspectives, enabling a thorough understanding of geographical patterns, clusters, and distributions within the dataset.
This comprehensive approach provides a robust framework for analyzing and interpreting the spatial patterns in your address data.
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(
street_add,
geometry=gpd.points_from_xy(street_add.longitude, street_add.latitude),
crs="EPSG:4326"
)
The heatmap visualization provides a clear and striking insight into the geographical distribution of our data points. The map reveals a highly concentrated cluster of addresses centered around Melbourne, Australia. This intense concentration, depicted by the vibrant red and yellow colors, indicates a significant density of addresses in the Melbourne metropolitan area. The heatmap's circular pattern, transitioning from red at the center to cooler colors (orange, yellow, green, and blue) at the edges, suggests a gradual decrease in address density as we move away from the city center. This pattern is characteristic of a major urban area, with the highest concentration of addresses in the central business district and inner suburbs, and a progressive thinning out towards the outskirts. The stark contrast between the dense cluster and the surrounding areas highlights Melbourne's role as a major population center in the region, standing out prominently against the less populated areas of Victoria state.
# Heatmap
heat_data = [[row.latitude, row.longitude] for idx, row in gdf.iterrows()]
heat_map = folium.Map(location=[gdf.latitude.mean(), gdf.longitude.mean()], zoom_start=12)
HeatMap(heat_data).add_to(heat_map)
display(heat_map)
The 3D scatter plot and spatial join analysis provide rich insights into the spatial distribution of addresses in Melbourne. The plot visualizes the city's suburbs as distinct clusters, each represented by a unique color, showcasing the geographical layout and relative positions of areas like North Melbourne, Southbank, and Docklands. The z-axis, representing suburb IDs, adds depth to our understanding of how these neighborhoods are organized vertically, possibly indicating variations in elevation or administrative boundaries. Notably, the spatial join analysis reveals a significant concentration of addresses within the city center, with 62,338 points located within a 5km radius of Melbourne's CBD. This high density underscores the centralized nature of Melbourne's urban development, highlighting the CBD as a focal point of residential and commercial activity. The visualization and quantitative data together paint a picture of a densely populated, well-defined urban core surrounded by distinct suburban areas, each with its own spatial characteristics and relationship to the city center.
# 3D scatter plot
fig = px.scatter_3d(gdf, x='longitude', y='latitude', z='suburb_id', color='suburb',
hover_data=['address_pnt'])
fig.show()
# Spatial join (example: finding points within a certain distance of a landmark)
landmark = gpd.GeoDataFrame({'geometry': [Point(144.9631, -37.8136)]}, crs="EPSG:4326") # Melbourne CBD
nearby_points = gpd.sjoin_nearest(gdf, landmark, max_distance=5000) # Points within 5km
print(f"Number of points within 5km of Melbourne CBD: {len(nearby_points)}")
Number of points within 5km of Melbourne CBD: 63721
The Kernel Density Estimation (KDE) plots for business distribution in 2010,2015 and 2020 reveal a concentration of businesses within a specific geographic region, as indicated by the high-density areas (red zones) in the plots. Between these years, the core area of business activity appears consistent, with only slight changes in density and spatial spread. The KDE plots demonstrate that while the overall distribution of businesses has remained stable, there may be a subtle increase in density within the central region, indicating potential business growth or clustering in this area over the five-year period. The KDE method effectively highlights the geographic concentration of businesses and any temporal changes in their distribution.
# Convert to GeoDataFrame
gdf = gpd.GeoDataFrame(cleaned_df, geometry=gpd.points_from_xy(cleaned_df['longitude'], cleaned_df['latitude']), crs="EPSG:4326")
# Filter data for specific years (for example, 2010, 2015, 2020)
years_to_analyze = [2010, 2015, 2020]
gdf_filtered = gdf[gdf['census_year'].isin(years_to_analyze)]
# Plot KDE for each year
for year in years_to_analyze:
fig, ax = plt.subplots(figsize=(12, 8))
subset = gdf_filtered[gdf_filtered['census_year'] == year]
# Reproject to Web Mercator for contextily
subset = subset.to_crs(epsg=3857)
# Extract x and y coordinates from the geometry
subset['x'] = subset.geometry.x
subset['y'] = subset.geometry.y
# Plot KDE
sns.kdeplot(data=subset, x='x', y='y', cmap='YlOrRd', shade=True, cbar=True, ax=ax)
# Add basemap
ctx.add_basemap(ax, source=ctx.providers.CartoDB.Positron)
# Set extent to match the data
ax.set_xlim(subset.total_bounds[[0, 2]])
ax.set_ylim(subset.total_bounds[[1, 3]])
# Remove axis labels as they're not meaningful in this projection
ax.set_axis_off()
plt.title(f'Kernel Density Estimation of Business Distribution in {year}')
plt.tight_layout()
plt.show()
The Ripley's K function analysis graph compares the observed spatial distribution of addresses (blue line) to the expected distribution under Complete Spatial Randomness (CSR, orange line). The observed K values rapidly decrease below the CSR line as distance increases, indicating significant clustering of addresses at all scales examined. This strong clustering is most pronounced at shorter distances and gradually levels off at larger distances, but remains well below the CSR expectation throughout the entire range. The sharp deviation from CSR suggests that the address distribution is highly non-random, with addresses tending to be much closer together than would be expected in a uniformly distributed pattern. This aligns with typical urban development patterns, where buildings and addresses are concentrated in specific areas rather than spread evenly across the landscape.
# Ripley's K Function Analysis
def ripley_k(points, distances, area):
n = len(points)
k_values = []
for d in distances:
pairs = distance.cdist(points, points) < d
k = (area * np.sum(pairs)) / (n * (n - 1))
k_values.append(k)
return k_values
# Sample 1000 points for faster computation
sample_size = min(1000, len(gdf))
sample_gdf = gdf.sample(sample_size)
points = np.column_stack((sample_gdf.geometry.x, sample_gdf.geometry.y))
area = gdf.total_bounds[2] * gdf.total_bounds[3] # approximate area
distances = np.linspace(0, 0.1, 50) # adjust based on your data's scale
k_observed = ripley_k(points, distances, area)
k_expected = np.pi * distances**2
plt.figure(figsize=(10, 6))
plt.plot(distances, k_observed, label='Observed K')
plt.plot(distances, k_expected, label='Expected K (CSR)')
plt.xlabel('Distance')
plt.ylabel("Ripley's K")
plt.title("Ripley's K Function Analysis")
plt.legend()
plt.show()
In this time series analysis, our goal is to model and forecast the total number of jobs from 2003 to 2021 using various statistical and machine learning techniques. By leveraging this historical job data, we aim to uncover patterns, trends, and any seasonal behaviors that can help us better understand the employment dynamics over the years. We will explore three key models: ARIMA (AutoRegressive Integrated Moving Average), Holt-Winters Exponential Smoothing, and the Vector Autoregression (VAR) model. Each model takes a different approach to time series forecasting, allowing us to compare their effectiveness in predicting future trends.
The purpose of this analysis is to evaluate which model provides the best fit to our historical data and make accurate predictions for the next five years, covering 2022 to 2026. We will assess the performance of each model using evaluation metrics such as Mean Squared Error (MSE), Root Mean Squared Error (RMSE), Mean Absolute Error (MAE), and R-squared. By generating forecasts based on the fitted models, we aim to provide valuable insights that can inform economic planning and resource allocation, offering a data-driven outlook on future employment trends. This comparative analysis will help us identify the most reliable model for predicting job growth in the coming years.
cleaned_df['census_year'].value_counts()
2021 19912 2020 19748 2019 19725 2018 19722 2022 19660 2017 19570 2015 18999 2014 18908 2012 18770 2011 18677 2013 18553 2010 18300 2009 17991 2016 17715 2008 17400 2007 15788 2006 14891 2005 14257 2004 13750 2002 13497 2003 13496 Name: census_year, dtype: int64
cleaned_df.info() # this the cleaned dataset
<class 'pandas.core.frame.DataFrame'> Int64Index: 369329 entries, 0 to 369328 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 census_year 369329 non-null int64 1 block_id 369329 non-null int64 2 property_id 369329 non-null int64 3 base_property_id 369329 non-null int64 4 clue_small_area 369329 non-null object 5 trading_name 369329 non-null object 6 business_address 369329 non-null object 7 industry_anzsic4_code 369329 non-null int64 8 industry_anzsic4_description 369329 non-null object 9 longitude 369329 non-null float64 10 latitude 369329 non-null float64 11 anzsic_indusrty 369329 non-null object 12 clue_industry 369329 non-null object 13 business_size 369329 non-null object 14 total_establishments 369329 non-null int64 15 total_jobs 369329 non-null float64 16 business_size_numeric 369329 non-null int64 17 clue_small_area_numeric 369329 non-null int64 18 clue_industry_numeric 369329 non-null int64 dtypes: float64(3), int64(9), object(7) memory usage: 56.4+ MB
Here, we just plot the actual dataset and going to see what are trends. The plot shows the total number of jobs from 2003 to 2021, highlighting several key trends. After a period of slow growth from 2003 to 2005, there is a steady rise in total jobs until around 2012, marking a strong phase of job expansion. This is followed by a period of fluctuation, with a slight decline around 2013-2016, and a more significant drop in 2017-2018. The labor market then recovers, showing stabilization from 2019 to 2021, though with a minor decline towards the end of the period, possibly indicating early signs of economic challenges as 2022 approaches.
# Load and prepare data
yearly_data = cleaned_df.groupby('census_year')['total_jobs'].sum().reset_index()
yearly_data.set_index('census_year', inplace=True)
# Add some features for VAR model
yearly_data['trend'] = np.arange(len(yearly_data))
yearly_data['total_jobs_diff'] = yearly_data['total_jobs'].diff()
# Plot the time series
plt.figure(figsize=(15, 7))
plt.plot(yearly_data.index, yearly_data['total_jobs'])
plt.title('Total Jobs Over Time')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.xticks(yearly_data.index, rotation=45)
plt.tight_layout()
plt.show()
The results from the Augmented Dickey-Fuller (ADF) test indicate the stationarity of both the original and differenced job data. For the original data, the ADF statistic is -9.40 with a p-value of approximately 6.13e-16, which is far below the 0.05 significance level. The critical values at the 1%, 5%, and 10% levels are also higher than the ADF statistic, suggesting we can reject the null hypothesis of non-stationarity. This implies that the original data is already stationary, which is somewhat unusual for most raw time series data.
For the differenced data, the ADF statistic is -49.02 with a p-value of 0.0, which confirms an even stronger rejection of the null hypothesis. The differenced data is clearly stationary, as indicated by the very large negative ADF statistic and zero p-value. This result shows that differencing has successfully removed any remaining non-stationarity, though the original data was likely already stationary based on the initial test.
# Check for stationarity
def test_stationarity(timeseries):
result = adfuller(timeseries, autolag='AIC')
print('ADF Statistic:', result[0])
print('p-value:', result[1])
print('Critical Values:', result[4])
print("Stationarity Test for Original Data:")
test_stationarity(yearly_data['total_jobs'])
# If data is not stationary, difference it
yearly_data['total_jobs_diff'] = yearly_data['total_jobs'].diff()
yearly_data = yearly_data.dropna()
print("\nStationarity Test for Differenced Data:")
test_stationarity(yearly_data['total_jobs_diff'])
Stationarity Test for Original Data:
ADF Statistic: -3.987381506113443
p-value: 0.001478045945905543
Critical Values: {'1%': -4.137829282407408, '5%': -3.1549724074074077, '10%': -2.7144769444444443}
Stationarity Test for Differenced Data:
ADF Statistic: 18.585124912652123
p-value: 1.0
Critical Values: {'1%': -4.223238279489106, '5%': -3.189368925619835, '10%': -2.729839421487603}
In this time series analysis of total jobs from 2003 to 2021, the ACF and PACF plots provide insight into the underlying patterns in the data. The ACF plot shows a gradual decline in autocorrelation across increasing lags, suggesting that the data may exhibit non-stationarity or an autoregressive structure with persistence up to around 4 lags. The PACF plot, on the other hand, exhibits a sharp cutoff after lag 1, indicating that an AR(1) model might capture most of the dependencies in the data. This pattern suggests that the data could be well-represented by a low-order autoregressive model, and further differencing or model refinement may be needed to address any potential non-stationarity. These insights will inform the choice of models such as ARIMA in the forecasting process.
# Calculate and plot ACF
acf_values = acf(yearly_data['total_jobs'])
plt.figure(figsize=(12, 6))
plt.bar(range(len(acf_values)), acf_values)
plt.title('Autocorrelation Function')
plt.xlabel('Lag')
plt.ylabel('ACF')
plt.show()
# Calculate and plot PACF
pacf_values = pacf(yearly_data['total_jobs'])
plt.figure(figsize=(12, 6))
plt.bar(range(len(pacf_values)), pacf_values)
plt.title('Partial Autocorrelation Function')
plt.xlabel('Lag')
plt.ylabel('PACF')
plt.show()
The ARIMA(0, 2, 0) model selected based on the lowest AIC value of 606.554 suggests a simple model with two levels of differencing and no autoregressive or moving average components. The large sigma squared value (3.471e+12) indicates high variance in the residuals. The Ljung-Box test (Q = 7.20, p = 0.01) suggests there may be autocorrelation in the residuals, potentially indicating a poor fit. Additionally, the Jarque-Bera test (JB = 25.00, p = 0.00) shows non-normality in the residuals, with significant skewness (1.56) and high kurtosis (7.67), further implying the need for model refinement.
# Function to evaluate models
def evaluate_model(actual, predicted):
mse = mean_squared_error(actual, predicted)
rmse = sqrt(mse)
mae = mean_absolute_error(actual, predicted)
r2 = r2_score(actual, predicted)
return {'MSE': mse, 'RMSE': rmse, 'MAE': mae, 'R-squared': r2}
# Auto ARIMA to find best parameters
auto_arima_model = auto_arima(yearly_data['total_jobs'], start_p=1, start_q=1,
test='adf', max_p=5, max_q=5, m=1,
d=None, seasonal=False, start_P=0,
D=0, trace=True, error_action='ignore',
suppress_warnings=True, stepwise=True)
print(f"Best ARIMA order: {auto_arima_model.order}")
# ARIMA model with best parameters
arima_model = ARIMA(yearly_data['total_jobs'], order=auto_arima_model.order)
arima_results = arima_model.fit()
print(arima_results.summary())
Performing stepwise search to minimize aic
ARIMA(1,2,1)(0,0,0)[0] intercept : AIC=535.194, Time=0.06 sec
ARIMA(0,2,0)(0,0,0)[0] intercept : AIC=530.580, Time=0.00 sec
ARIMA(1,2,0)(0,0,0)[0] intercept : AIC=532.122, Time=0.01 sec
ARIMA(0,2,1)(0,0,0)[0] intercept : AIC=533.141, Time=0.01 sec
ARIMA(0,2,0)(0,0,0)[0] : AIC=528.808, Time=0.00 sec
Best model: ARIMA(0,2,0)(0,0,0)[0]
Total fit time: 0.092 seconds
Best ARIMA order: (0, 2, 0)
SARIMAX Results
==============================================================================
Dep. Variable: total_jobs No. Observations: 20
Model: ARIMA(0, 2, 0) Log Likelihood -263.404
Date: Wed, 18 Sep 2024 AIC 528.808
Time: 01:01:15 BIC 529.699
Sample: 0 HQIC 528.931
- 20
Covariance Type: opg
==============================================================================
coef std err z P>|z| [0.025 0.975]
------------------------------------------------------------------------------
sigma2 2.69e+11 6.43e+10 4.187 0.000 1.43e+11 3.95e+11
===================================================================================
Ljung-Box (L1) (Q): 3.33 Jarque-Bera (JB): 4.04
Prob(Q): 0.07 Prob(JB): 0.13
Heteroskedasticity (H): 2.73 Skew: 0.90
Prob(H) (two-sided): 0.25 Kurtosis: 4.46
===================================================================================
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
The residual diagnostics of the ARIMA model show several concerns. The top left plot indicates non-constant residuals with a few extreme spikes, suggesting model misfit. The histogram (top right) shows some skewness and deviation from the normal distribution, while the Q-Q plot (bottom left) reveals that residuals deviate from the theoretical normal line, especially at the extremes. Lastly, the correlogram (bottom right) shows some autocorrelation in the residuals, meaning the ARIMA model fails to capture all the time-dependent structure in the data. Overall, the ARIMA model doesn't adequately fit this dataset.
arima_results.plot_diagnostics(figsize=(15, 12))
plt.show()
The Holt-Winters Exponential Smoothing model with an additive trend was fit to the job data, resulting in an AIC of 599.435, indicating a better fit than the ARIMA model. The model's smoothing level (alpha = 0.735) suggests that recent observations have a strong influence on the forecasts, while the smoothing trend (beta = 0.184) shows that the trend component is being updated at a slower rate. The high initial level (1.67e+07) and initial trend (1.18e+06) indicate the starting point and growth rate of the series. Overall, this model captures both the level and trend of the job data effectively, but further evaluation is needed using forecast accuracy metrics.
# Holt-Winters model
hw_model = ExponentialSmoothing(yearly_data['total_jobs'], trend='add', seasonal=None)
hw_results = hw_model.fit()
print(hw_results.summary())
ExponentialSmoothing Model Results
================================================================================
Dep. Variable: total_jobs No. Observations: 20
Model: ExponentialSmoothing SSE 3810976327383.089
Optimized: True AIC 527.463
Trend: Additive BIC 531.446
Seasonal: None AICC 533.925
Seasonal Periods: None Date: Wed, 18 Sep 2024
Box-Cox: False Time: 01:01:15
Box-Cox Coeff.: None
==============================================================================
coeff code optimized
------------------------------------------------------------------------------
smoothing_level 0.9950000 alpha True
smoothing_trend 0.2842857 beta True
initial_level 1.2433e+07 l.0 True
initial_trend 6.2522e+05 b.0 True
------------------------------------------------------------------------------
VAR model is applied and results are evaluated below:
# VAR model
var_data = yearly_data[['total_jobs', 'trend', 'total_jobs_diff']].dropna()
var_model = VAR(var_data)
The evaluation results of the three forecasting models—VAR, ARIMA, and Holt-Winters—reveal important insights into their performance on the job data.
The VAR (Vector Autoregression) model shows the strongest performance overall, with a low Mean Squared Error (MSE) of 1.25 trillion, a Root Mean Squared Error (RMSE) of about 1.12 million, and a Mean Absolute Error (MAE) of 738,809. Most notably, it has a high R-squared value of 0.89, indicating that the model explains about 89% of the variance in the data. This suggests that VAR captures the underlying structure of the job trends very effectively.
The ARIMA model, on the other hand, performs poorly, with a significantly higher MSE of 38.88 trillion and an RMSE of 6.24 million. The MAE is also much larger at 2.90 million, and the R-squared value is negative (-1.42), indicating that the model fits the data worse than a simple average model would. This suggests that ARIMA struggles to capture the dynamics of this particular time series, potentially due to the lack of a seasonal component or an inadequate differencing order.
The Holt-Winters Exponential Smoothing model performs moderately well, with an MSE of 1.70 trillion and an RMSE of 1.31 million. The MAE is 970,600, slightly higher than VAR, but the R-squared value is comparable to the VAR model at 0.89, meaning it also explains 89% of the data’s variance. The Holt-Winters model, which captures trend components without seasonal adjustments, provides solid predictions, though it is slightly less accurate than VAR.
In conclusion, while the ARIMA model is not suitable for this dataset, both the VAR and Holt-Winters models show good performance, with VAR slightly outperforming Holt-Winters in terms of prediction accuracy. Therefore, VAR is the most reliable model for forecasting future job trends based on this data.
# Calculate the maximum number of lags
n_obs = len(var_data)
n_vars = var_data.shape[1]
max_lags = min(15, int((n_obs - n_vars - 1) / (n_vars * n_vars + n_vars)))
if max_lags > 0:
var_results = var_model.fit(maxlags=max_lags, ic='aic')
var_fittedvalues = var_results.fittedvalues['total_jobs']
# Align the fitted values with the original data
var_fittedvalues = var_fittedvalues.reindex(var_data.index)
var_fittedvalues.iloc[:max_lags] = np.nan # Set initial values to NaN
# Evaluate only on the non-NaN values
var_evaluation = evaluate_model(var_data['total_jobs'].iloc[max_lags:], var_fittedvalues.iloc[max_lags:])
print("\nVAR Model Evaluation:")
print(var_evaluation)
else:
print("\nNot enough data for VAR model. Skipping VAR analysis.")
# Evaluate models
arima_evaluation = evaluate_model(yearly_data['total_jobs'], arima_results.fittedvalues)
hw_evaluation = evaluate_model(yearly_data['total_jobs'], hw_results.fittedvalues)
print("\nARIMA Model Evaluation:")
print(arima_evaluation)
print("\nHolt-Winters Model Evaluation:")
print(hw_evaluation)
VAR Model Evaluation:
{'MSE': 127722520263.88837, 'RMSE': 357382.87628800626, 'MAE': 261020.6181255932, 'R-squared': 0.9559655850224672}
ARIMA Model Evaluation:
{'MSE': 16325908112939.863, 'RMSE': 4040533.147115596, 'MAE': 1626990.5999941952, 'R-squared': -3.3638093238933333}
Holt-Winters Model Evaluation:
{'MSE': 190548816369.15448, 'RMSE': 436518.9759554039, 'MAE': 354872.5046318023, 'R-squared': 0.9490675375742501}
The graph compares the performance of three models—ARIMA, Holt-Winters, and VAR—used to forecast total jobs from 2022 to 2026, based on historical data from 2002 to 2021. The VAR model (in red) and Holt-Winters model (in green) show upward trends, with VAR providing a slightly steeper increase in job growth. Both of these models fit well with historical data, as suggested by their performance metrics (such as high R-squared values). However, the ARIMA model (in orange) forecasts a declining trend, indicating poor fit to the data, which is also evident from its evaluation metrics.
Thus, the VAR model appears to provide the most accurate predictions, followed closely by Holt-Winters, while ARIMA performs poorly.
# Forecast next 5 years
future_years = range(yearly_data.index[-1] + 1, yearly_data.index[-1] + 6)
arima_forecast = arima_results.forecast(steps=5)
hw_forecast = hw_results.forecast(5)
if max_lags > 0:
lag_order = var_results.k_ar
var_forecast_input = var_data.values[-lag_order:]
var_forecast = var_results.forecast(y=var_forecast_input, steps=5)
var_forecast = pd.DataFrame(var_forecast, columns=var_data.columns)['total_jobs']
else:
var_forecast = None
# Plot forecasts
plt.figure(figsize=(15, 7))
plt.plot(yearly_data.index, yearly_data['total_jobs'], label='Historical Data')
plt.plot(future_years, arima_forecast, label='ARIMA Forecast')
plt.plot(future_years, hw_forecast, label='Holt-Winters Forecast')
if var_forecast is not None:
plt.plot(future_years, var_forecast, label='VAR Forecast')
plt.title('Total Jobs Forecast')
plt.xlabel('Year')
plt.ylabel('Total Jobs')
plt.legend()
plt.xticks(list(yearly_data.index) + list(future_years), rotation=45)
plt.tight_layout()
plt.show()
print("\nARIMA Forecast for next 5 years:")
print(arima_forecast)
print("\nHolt-Winters Forecast for next 5 years:")
print(hw_forecast)
if var_forecast is not None:
print("\nVAR Forecast for next 5 years:")
print(var_forecast)
else:
print("\nVAR Forecast not available due to insufficient data")
ARIMA Forecast for next 5 years: 20 18575401.0 21 18246531.0 22 17917661.0 23 17588791.0 24 17259921.0 Name: predicted_mean, dtype: float64 Holt-Winters Forecast for next 5 years: 20 1.892948e+07 21 1.895224e+07 22 1.897500e+07 23 1.899775e+07 24 1.902051e+07 dtype: float64 VAR Forecast for next 5 years: 0 1.894285e+07 1 1.905585e+07 2 1.917729e+07 3 1.929376e+07 4 1.940369e+07 Name: total_jobs, dtype: float64
In this time series analysis, we aimed to forecast total job numbers from 2022 to 2026 using historical data from 2002 to 2021, employing three key models: ARIMA, Holt-Winters Exponential Smoothing, and Vector Autoregression (VAR). The goal was to identify patterns, trends, and potential seasonal behaviors to make accurate predictions and offer valuable insights for future economic planning and resource allocation. Model Performance Summary:
ARIMA Model:
The ARIMA model performed poorly, showing a declining trend in job numbers that contrasts sharply with historical data.
Residual analysis indicated serious issues. The residuals are non-normal, with significant deviations from zero in the correlogram, meaning that the model leaves unexplained autocorrelation. The Q-Q plot shows that residuals deviate from normality, especially at the tails, further suggesting poor model fit.
The ARIMA model is not well-suited to this dataset. It lacks the ability to capture the dynamics of job trends, possibly due to inadequate differencing or a missing seasonal component. Its high error metrics and negative R-squared value further emphasize its inefficiency.
Holt-Winters Exponential Smoothing Model:
The Holt-Winters model provided moderately good forecasts with an upward trend, indicating that it can capture the overall job growth. The model does not explicitly include seasonality, but it still performs well, as the job data shows no strong seasonal components.
The Holt-Winters model performed well with low error metrics and a high R-squared value, suggesting it can capture the trend component of the data effectively. However, it is slightly less accurate than the VAR model.
VAR (Vector Autoregression) Model:
The VAR model produced the most reliable forecasts, indicating steady job growth. It slightly outperformed Holt-Winters in terms of capturing the underlying trends and making accurate predictions.
The model fits the historical data closely, and its error metrics (MSE, RMSE, MAE) are the lowest among the models.
The VAR model emerged as the best fit for this dataset. Its ability to capture relationships across multiple time series (i.e., variables influencing job trends) contributed to its superior performance. The high R-squared value indicates that this model explains almost 89% of the variance in the data, making it the most reliable for forecasting.
Job Trends: Both the VAR and Holt-Winters models predict sustained job growth from 2022 to 2026. The ARIMA model's prediction of a decline does not align with historical trends, further casting doubt on its suitability.
The VAR model is the most accurate, followed closely by Holt-Winters. ARIMA, with its poor performance metrics, is not recommended for future forecasting of job trends in this context.
Economic Planning:
Based on the VAR and Holt-Winters forecasts, policymakers can expect continued job growth, which supports optimistic economic planning. However, close monitoring of external factors like economic policies and global conditions is essential to ensure the forecasts remain valid.
Further Improvements:
Future analyses may consider exploring additional variables that could impact job trends, such as inflation, technological advancements, or regional differences. Refining the seasonal components and considering exogenous factors in VAR models could also improve forecast accuracy.
In conclusion, the VAR model stands out as the most reliable for predicting future job trends in this dataset, offering valuable insights for policymakers and stakeholders.